Introduction

Anime has been on the rise ever since the pandemic forced people to isolate in their homes. Anime is an abbreviation of animation in Japan. People use that term to describe any animation produced from Japan.

Though it was originally from Japan, anime has exploded in popularity. It was stated by the Wall Street Journal that overseas sales by japanese companies that focuses on animation and licensing characters for other goods generated nearly 10 billion in 2018 and accounted for half of the total industries revenue1. Another example of the exploding popularity of anime The “Demon Slayer:Mugen Train” made $19.5 million dollars which broke American box-office records for a foreign language2. Making anime a big deal to the western side of the of the world.

For it’s popularity, what is anime and how does people rate animes. In this project I will be exploring genres in anime be affecting the ratings of anime and see if there is a difference between the rating between the most popular genre and the best rated genre.

Data

Data can be found in the tidytuesday forum in 2019 section. Link will be provided via footnote3. In addition all the information on how it was collected is in hte footnote,

Clean

The original data set is a total mess especially the genres section where a single column holds all the genres of the anime. Luckily,there is code that people in charge of tidytuesday made to clean up the data to use so that will be used here.

file<-here("data","anime.csv")
anime<-read_csv(file)
anime

Overall score distribution

Let’s see how anime is usually score so with an histogram. We will be using the score variable which a user rates a show based on a 1-10 scale where 1 is the worst rating and 10 being the best. Here is the following:

ggplot(anime,aes(x=score))+
  geom_histogram(bins=30,color="red",fill="darkblue")+
  labs(x="Score",y="Count",title="Distribution of Anime Ratings")+
  theme_classic()

The histogram shows us that most reviews of anime are around the 7 Score mark where the highest peak is at about 7.5. In this set the average is 6.87 which is near a score of 7.

Creating the genre table

The following code helps create a table where each genre for each anime in the list is presented. I did this since in the original clean set there where a lot of reapeats of anime because the genre, studio and directors would be different for a single anime.

genre_list<-anime %>% 
  group_by(genre,name) %>% 
  summarise(mean=mean(score,na.rm = TRUE)) %>%
  mutate(rank_mean=min_rank(-mean))

genre_count<-genre_list %>% 
  count() %>%
  ungroup() %>%
  mutate(usage_rank=min_rank(-n)) %>% 
  arrange(usage_rank)

genre_mean<-genre_list %>%
  ungroup() %>% 
  group_by(genre) %>% 
  summarise(score=mean(mean,na.rm=TRUE)) %>% 
  mutate(score_rank=min_rank(-score)) %>% 
  arrange(score_rank)

genre_anime<-genre_mean %>% 

    left_join(genre_count,by="genre")

At the end of data wrangling, a table is create where the mean score number for each genre and number of times the genre was used in the industry. Each mean and usage of genre has their own separate rank to find the most used genre and the best scored genre.

Most used Genre and the Well_Scored Genre

After creating a table, let’s now filter it by finding the top 5 genres that are highly rated and the top 5 most used genres.

well_recieved_genre<-genre_anime %>% 
  filter(score_rank<=5)
ggplot(well_recieved_genre,aes(x=fct_reorder(genre,-score),y=n,fill=genre))+
  geom_col()+
  labs(x="Genre",y="Count",title="Top 5 Well-Scored Genres")+
  ylim(0,5500)+
  theme(axis.text.x = element_text(angle = 45,hjust=1))+
  scale_fill_discrete(name="Genre")+
  geom_text(aes(label=n,vjust=-.4))+
  geom_text(aes(label=signif(score,digits = 3),vjust=1),vjust=1.5)+
  theme_minimal()

In this table, the bar are organized by the average score and the score is shown as the bottom number. The number on top is the amount of times that genre has been used in the data set. We see here that Mystery, Josei, Thiller Shounen, and police are the top five genres.

Mystery and Thriller are similar genre types as in shows and movies but the rest will require som explanation

Josei: A genre that centers around the interpersonal relationships of the character or a realistic romance. Usually their for girls and woman who are in their late teen to early forties4.

Shounen:A genre that is usually oriented heavily in action and one of the core themes of it is friendship. It’s usually made for boys arounf the age of 12-185.

Police: A genre that emphasizes law enforces’ challenges and their struggle in work6.

most_used_genre<-genre_anime %>% 
  filter(usage_rank<=5) %>% 
  arrange(usage_rank)
ggplot(most_used_genre,aes(x=fct_reorder(genre,-n),y=n,fill=genre))+
  geom_col()+
  labs(x="Genre",y="Count",title="Top 5 Most common used Genre")+
  ylim(0,5500)+
  theme(axis.text.x = element_text(angle = 45,hjust=1))+
  scale_fill_discrete(name="Genre")+
  geom_text(aes(label=n),vjust=-1)+
  geom_text(aes(label=signif(score,digits = 3)),vjust=1)+
  theme_minimal()

In this graph,they’re the most used genres in the data set so they are organized by the count from greatest to least. The number on top is the count of the genre ,and the number on the bottom is the mean score of the genre.

When we compare the graphs we can see the best rated genres don’t even mange to break 2000 counts compare to the most common used. The only one that was close to 2000 from the first graph was Shonen.Interesting enough the most common genre were receiving scores around 6 while the best score were receiving scores around 7. This difference in score made me want to look in deeper between the genres.

In this study I decided to look deeper into the most popular genre which is Comedy and the best rated genre being Mystery.

What are the tops animes in these Genres?

To see some difference< I decided to look into the top five animes with that genre and create another histograms to see the distribtuoin os scores between them.

comedy_list<- genre_list %>% 
  filter(genre=="Comedy")

comedy_list_sm<- genre_list %>% 
  filter(genre=="Comedy",
         rank_mean<=5) %>% 
  arrange(rank_mean)
comedy_list_sm
ggplot(comedy_list,aes(x=mean,color=mean))+
  geom_histogram(bins=30,color="red",fill="darkblue")+
  labs(x="Score",y="Count",title="Distribution of Comedy Anime Score")+
  theme_classic()

In the comedy section, We see that the top five a a perfect score anime, and the rest with score with 9. However, when seeing the distribution we see that most of the scores are clump between the score 6 and 7.

mystery_list<- genre_list %>% 
  filter(genre=="Mystery")

mystery_list_sm<- genre_list %>% 
  filter(genre=="Mystery",
         rank_mean<=5) %>% 
  arrange(rank_mean)
mystery_list_sm
ggplot(mystery_list,aes(x=mean))+
  geom_histogram(bins=30,color="red",fill="darkblue")+
  labs(x="Score",y="Count",title="Distribution of Mystery Score")+
  theme_classic()

For Mystery, we see that the top 5 anime have high score of 8. In the distribution we see that most of the score for Mystery reside between 7-8. Visually there seems like a difference between the score of Mystery and Comedy so let’s see if there is a difference using statistical inferencing.

Statistical Inferenceing between Mystery and Comedy

We filter the genre_list to only have Mystery and Comedy genres so we can do Welch Two sample t-test.

two_genre<-genre_list %>% 
  filter(genre==c("Mystery","Comedy"))

Null hypothesis: Mean scores difference of the Thriller and Comedy isn’t different.

Alternative hypothesis: There is a difference between the mean score of Thrillers and Comedy

Here we are using a alpha value of .05 as our benchmark

t.test(mean~genre,data=two_genre)
## 
##  Welch Two Sample t-test
## 
## data:  mean by genre
## t = -10.875, df = 402.97, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Comedy and group Mystery is not equal to 0
## 95 percent confidence interval:
##  -0.6964396 -0.4831987
## sample estimates:
##  mean in group Comedy mean in group Mystery 
##              6.527998              7.117818

In this test we found that our p-value is 2.2e-16 which R studio Uses to show a impossible statistic. This means we can reject the null hypothesis and have evidence that the difference of score mean between Mystery and Comedy are different. We can’t say if one is better or worse than because the test I did just measure difference and not what direction.

Conclusion

Anime has becoming a big part of our media in recent years. Similar to tv shows, anime genres do have some affect on the score people give. Comedy and Thiller average scores are different but inreality that difference doesn’t matter to the audience. What matters if we find a good show to invest and cherish, we create the good memories that many anime fans, including me, can share to others and show the wonders of japanese animation.


  1. https://www.wsj.com/articles/the-world-is-watching-more-animeand-streaming-services-are-buying-11605365629↩︎

  2. https://www.economist.com/business/2021/06/05/streaming-and-covid-19-have-entrenched-animes-global-popularity↩︎

  3. https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-04-23/readme.md↩︎

  4. https://japanesetactics.com/what-does-josei-mean-in-anime↩︎

  5. https://japanesetactics.com/what-does-shounen-mean-in-anime↩︎

  6. https://featuredanimation.com/anime-genres/↩︎